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We introduce a method to convert an ensemble of sequences of symbols into a weighted directed network 
whose nodes are motifs, while the directed links and their weights are defined from statistically significant co- 
occurences of two motifs in the same sequence. The analysis of communities of networks of motifs is shown to 
be able to correlate sequences with functions in the human proteome database, to detect hot topics from online 
social dialogs, to characterize trajectories of dynamical systems, and might find other useful applications to 
process large amount of data in various fields. 
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There are many examples in biology, in linguistics and in 
the theory of dynamical systems, where information resides 
and has to be extracted from corpora of raw data consisting in 
sequences of symbols. For instance, a written text in English 
or in another language is a collection of sentences, each sen- 
tence being a sequence of the letters from a given alphabet. 
Not all sequences of letters are possible, since the sentences 
are organized on a lexicon of a certain number of words. In 
addition to this, different words are used together in a struc- 
tured and conventional way IIS lL IS 3fl . Similarly, in biology, 



DNA nucleotides or aminoacid sequence data can be seen 
as corpora of strings S 0, S4]. For example, it is well 
known that proteomes are far from being a random assem- 
bly of peptides, since clustering of aminoacids J3l and strong 
correlations among proteomic segments J3l have been clearly 
demonstrated. These results give meaning to the metaphor of 
protein sequences regarded as texts written in a still unknown 
language |3L ISO- Sequences of symbols can also be found in 
time series generated by dynamical systems. In fact, a trajec- 
tory in the phase space can be transformed into sequence of 
symbols, by the so-called "symbolic dynamic" approach ifioll . 
The basic idea is to partition phase space into a finite number 
of regions, each of which is labelled with a different sym- 
bol. In this way, each initial condition gives rise to a sequence 
of symbols representing the initial cell, the cell occupied at 
the first iterate, the cell occupied at the second iterate, and so 
forth. 

In all the examples mentioned above, the main challenge 
is to decipher the message contained in the corpora of data 
sequences, and to infer the underlying rules that govern their 
production. In order to do this, one needs: i) to detect the 
fundamental units carrying information, like words do in lan- 
guage, and ii) to study their combination syntax in the ensem- 
ble of sequences. In fact, information in its general mean- 
ing is located not only at the level of strings, but also in their 
correlation patterns H 1 lL Il2ll . In this Letter, we introduce a 



method to transform a generic corpus of strings, such as writ- 
ten texts, protein sequence data, sheet music, a collection of 
dance movement sequences 11311 . into a network representing 



the significant and fundamental units of the original message 
together with their relationships. The method relies on a sta- 
tistical procedure to detect patterns carrying relevant informa- 
tion, and works as follows. We first construct a dictionary 
of the recurrent strings of k letters, called fc-motifs. Recur- 
rent strings play, in this more general context, the same role 
as words in written or spoken languages. We then construct a 
/c-motif network, a graph in which each node is one entry of 
the dictionary, and a directed arc between two nodes is drawn 
when the ordered co-occurence of the two motifs is statisti- 
cally significant in the dataset analyzed. We will show how 
the analysis of topological properties of networks of /c-motifs, 
such as the detection of community structures lfl4l[l5ll . allows 
to extract important information encoded in the original data. 
In particular, we will consider the application of the method 
to datasets in three different domains, namely, biological se- 
quences of proteins, messages from online social networks, 
and sequences of symbols generated by the trajectories of a 
dynamical system. 

Let us consider an ensemble S of S sequences of symbols. 
Each sequence s (s = 1, 2, . . . , S) is a string of letters from an 
alphabet A of A letters, A = {eri, 02, cta}- In general, the 
strings can have different lengths. We indicate by l s the length 
of sequence s, and by L = Yl s =i tne tota ^ length of the 
ensemble. An example is provided by proteomes. A proteome 
is a collection of S « 10 4 proteins of a species. Each protein 
is a sequence of length l s , ranging from 10 2 to 10 3 , made 
of symbols from an alphabet A with A = 20 letters, A = 
{<7i, (J2, 0-20}, where each a labels one of the aminoacids a 
protein can be made of. We define as k-string a segment of k 
contiguous letters X1X2 ■ ■ ■ Xk, where Xi e A Vi. The number 
of all possible fc-strings is A k , while from the ensemble of 
sequences S we can select only L — S ■ (k — 1) overlapping 
/c-strings, so that some of the possible /s-strings do not occurr, 
some of them occur once, others more than once, either in the 
same or in different sequences of symbols. We define as: 



p 0bS ( Xl X2---X k ) 



c(xix 2 ■ --x k ) 



T,( Xl ,x 2 ,-,x k )GA" c( Xl X2---X k ) 



(1) 
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the observed probability of a string x\X2 ■ ■ ■ Xk- This prob- 
ability is obtained by counting the total number of times, 
c{x\X2 ■ ■ ■ X].), the string actually occurs in the sequences of 
the ensemble. To assess for the statistical significance of the 
string, the probability in Eq.[TJhas to be compared with the ex- 
pected probability p exp (x\X2 ■ • ■ Xk) of the string occurrence. 
The latter can be evaluated under different assumptions. In 
fact, the joint probability p(x\X2 ■ • ■ Xk) can be written as: 

p(xiX 2 ■■■X k ) = p(xiX 2 ■ ■ ■ Xk-l)p(x k \x 1 X 2 • • • Xfc-l), 



and different approximations for the conditional probabili- 
ties p(xk\x 1X2 ■ ■ -Xk-x) lead to different values of the ex- 
pected probability p exp (x\X2 • • • Xk)- Namely, if we as- 
sume that the occurrence of a letter does not depend on 
any of the previous letters, i.e. p(xk \x1X2 • ■ • Xk-i) = 
p(xk), the expected probability is simply given by the prod- 
uct of the relative frequencies of the string's component let- 
ters: P ex p( X1 x2---x k ) = p° hs {xi)---p ohs {xk) mm. 

By using instead a first order Markov approximation, i.e. 
p{xk\x\X2 ■ ■ ■ Xk-i) = p(xk\xk-i), the expected proba- 



p exp (xiX2 ■■■Xk) 



bility can be expressed in the form: 

P ° bs (x 1 ) P ° bs {x 2 \x 1 )---p obs {xk\xk-i), where p obs ( Xj \ Xi ) 
is extracted from the countings as: p obs (xj\xi) = 
c{xiXj)/J2 x c(xiXj) = p obs (xiXj)/p obs (xi). This latter as- 
sumption is based on the fact that there is a minimal amount 
of memory in the sequence: a symbol of the sequence is cor- 
related to the previous one only. Here, we go beyond the 
approximation of Markov chains of order 1, by retaining as 
much memory as possible IIS4I1 . We assume: 



p exp (x 1 X 2 • • -X k ) = P° bS \X1X2 ■ ■ ■ Xk-l) ■ 
■P° hS (Xk\x2 ■■•Xk-l) 



(2) 



where the conditional probabilities can be evaluated from 
countings as: 



p obs {xk\x 2 ■ --Xk-i) 



c(x 2 x 3 ■■■Xk) 

J2x k C ( X 2X3---X k ) 



(3) 



or can be expressed in terms of the observed probability for 
shorter sequences as: 



„obs 



(x k \x 2 ■ -Xk-l) 



P° bS {x 2 ---Xk) 
p obs {x2---X k -l) 



(4) 



By using the latter expression, we can finally write the ex- 
pected probabilities in a more compact form: 

p exp (x l ) = p° bs ( Xl ) 
p exp (x lX 2) = p° bs (xix 2 ) 

, bs , ,p° bS (x 2 X 3 ) 



P eXP ( Xl X2X 3 ) = p° bS {x lX 2) 



p obs {x 2 ) 



(5) 



P eXp ( Xl X 2 ---Xk) = P obs ( Xl ---X k -l) 
P° bS (x2---X k ) 



p obs (x 2 ---X k ^ 1 ) 



This way, the expected probability of a given fc-string is evalu- 
ated based on observations for strings of up to (k—1) symbols. 
Therefore, by predicting the probability of appearance with a 
high order Markov model, our method allows to highlight the 
true A: -body correlations subtracting from them the effects due 
to (k—1) and lower order correlations. Based on observed and 
expected probabilities, a test of statistical significance, for in- 
stance a Z-score, is then performed for each fc-string. We de- 
fine k-motifs or recurrent k-strings, the statistically-relevant 
strings whose observed and expected number of occurrences 
are such as to validate the statistical test adopted, and we in- 
dicate as Z k the dictionary composed by all the selected fc- 
motifs iflill . 

Once we have constructed a lexicon of fundamental units, 
the next goal is to represent in a graph the way they are com- 
bined together. Recurrent fc-strings can be distributed differ- 
ently along the sequences: they can appear in single sequence 
or in more than one sequence, alone or in clusters. To ex- 
tract the non trivial patterns of correlated appearance of fc- 
motifs, we need to evaluate the probability for the random 
co-occurrence of two motifs, when these are uncorrelated. We 
estimate first the expected probability that motif X is followed 
by motif Y within a generic sequence of the ensemble S, then 
we sum over all the sequences of S. We denote as p(X) and 
p{Y) the probabilities of finding the two motifs in 5. In se- 
quence s, motif X can occupy positions ranging from the first 
to the (l s — 2fc)th site, where l s is the length of s, and fc is 
the length of the motif. We have assumed that the two mo- 
tifs cannot overlap. For each fixed position i of X on s, with 
i = l, (l s — 2k), there are (l s — 2k + 1 — i) possibili- 
ties for Y to appear in the sequence. Hence, the number of 
expected co-occurences of X and Y within s is given by: 
Y^i k (l s - 2k + 1- i)p(X)p(Y). In order to obtain the 
expected number of co-occurrences, we have to sum over all 
the sequence in the ensemble S. We finally get: 

S l s -2k 

N exp (Y\X) = p(X)p(Y) J2Y.( l *- 2k + 1 -^ = 

s=l t=l 

s 



l -p(X)p(Y) J2(l s - 2fc + l)(i a - 2fc + 2) 



(6) 

For each value of fc, we are now able to construct the k-motif 
network of the ensemble S, i.e. a directed network whose 
nodes are motifs in the dictionary Zk, and an arc point from 
node X to node Y if the number of times Y follows X in 
the ensemble of sequences is statistically significant. Further- 
more, a weight can be associated to the arc from X to Y, based 
on the extent to which the co-occurrence of the two motifs de- 
viates from expectation. 

This approach is able to represent the correlation patterns 
encrypted in the ensemble of sequences into a single object, 
the fc-motif network. Then, graph theory allows to extract in- 
formation from the structural properties of the network, and to 
retrieve the main message encoded in the original sequences. 
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FIG. 1. The 3-motifs network of the human proteome. Nodes be- 
longing to the same community are labeled by the same number and 
share the same colour. Most of the communities can be associated to 
a functional domain as described in table I in [22]. 



In particular, it is interesting to study the components of the 
fc-motif network or, if the graph is connected, its community 
structures, i.e. those groups of nodes tightly connected among 
themselves and weakly linked to the rest of the graph IU5I1 . 

In the following, we will consider the application of the 
method to three different datasets, belonging to three contexts 
as diverse as biology, social dialogs and dynamical systems. 
We will show how the community analysis of the related fc- 
motif networks enables to extract functional domains in pro- 
teomes, social cascades and hot topics in Twitter, and the in- 
crease of chaoticity in deterministic maps. 

In the biological context, many methods based on strings 
deviating from expectancy in genome IIS41 IS5I1 or in a pro- 
teome IS 611 have been already used to make functional de- 
ductions. Although they provide insight on many biological 
mechanisms IIS7ll . this approach turns out to be not sufficient 
for a complete and exhaustive interpretation of the genomic 
and proteomic message. A fundamental key to its compre- 
hension is in fact hidden in the correlations among recurrent 
patterns of strings, which are perfectly represented at a global 
scale in terms of fc-motif networks. Various features of these 
correlations translate into structural properties of fc-motif net- 
works. In Fig. Q] we illustrate, as an example, the 3-motif 
graph derived from the ensemble of human proteins (see 12211 
for details about the dataset). We have detected 15 different 
communities in the graph, labeled in the figure with different 
colours and numbers. By means of a research in biological 
databases, we can show that linked couples of motifs belong- 
ing to the same community all co-occur in the same kind of 
protein domains and that one can associate 9 of these 15 com- 
munities just to one domain (see table I in 12211 '). These re- 
sults are outstanding compared to the current methods to ex- 
tract functional protein domains, all based on multi-alignment 
of sequences, and cannot obtained if one uses a lower order 



TABLE I. The first ten most significant links between motifs, be- 
longing to 7 different communities in the Twitter dataset l22ll . Each 
community corresponds to a specific tweet or expression that gener- 
ated a topic cascade. 



motif 

1 


motif 
2 




CAjJlCaMUll Ul 1WCCL 


Topic 


9cle 
5bro 


gg27 
wn29 


955.3 
894.8 


GUARDIAN 1CM 
POLL Cameron 35% 
Brown 29% Clegg 27% 


poll results from 
various websites, 
journals, tv 
channels, etc 


son4 
don4 


4cle 
2c am 


no 1 1 
924. J 

881.7 


Brown wins on 44%, 
Clegg is second on 
42%, Cameron 13% 
None of them 1% 


lapo 


mete 


892.3 


www. slapometer. com 


A funny website 
on the election 


swed 
nesd 


nesd 
ayni 


864.7 
826.1 


hey Dave, Gordon and 
Nick : how about a 4th 
debate on Channel 4 
this Wednesday night 
without the rules ?! 


Proposal for a 4th 
debate among 
leaders, made by a 
journalist on his 
twitter page 


jami 
mine 


ncoh 
ohen 


842.0 
764.9 


Benjamin Cohen 


Journalist of 
Channel 4 News 


isob 


eymu 


831.4 


#disobeymurdoch 


hashtag 



Markov model, meaning that it is fundamental to take into ac- 
count both short- and long-range correlations (for more details 
on the fc-motif networks in proteomes, see 112211 ). 

Important information from fc-motif networks can also be 
retrieved from datasets of social dialogs and microblogging 
websites, like Twitter. Although in these cases, in principle, a 
dictionary is a-priori known, not all terms used in the Internet 
language are always listed in the dictionary lE^ : abbrevia- 
tions, "leet language" words, names of websites or of pub- 
lic personages, are just some examples. Moreover, some ex- 
pressions or combinations of terms appear more frequently in 
some periods or contexts due to the interest in some hot topics. 
We have found that communities of fc-motif networks derived 
from microblogging sequences in Twitter during the UK Elec- 
tion in April 2010 are able to detect exact ly those hot topics 
which generate information cascades IS15I1 . as shown in Fig.l 
and Table II of 112211 . In Table |III| we report the links with the 
highest significance together with the tweet associated to their 
community. Each tweet was the origin of a cascade and can 
be associated with a specific topic or event discussed during 
the election campaign (see I22I1 for details). 

Finally, fc-motif networks carry important information on 
sequences of symbols generated from trajectories of dynam- 
ical systems by the so-called "symbolic dynamic" approach 
ifioll . One is able, for instance, to distinguish ensembles of 
sequences generated by deterministic maps from those gener- 
ated by stochastic processes, by looking at the number of com- 
ponents and communities in the fc-motif network. In fact, the 
method, when applied to sequences generated by determinis- 
tic equations that are increasingly non-linear, still finds short 
motifs, while the same does not occur for ensembles of ran- 
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FIG. 2. Standard map: number of components in the 3-motifs net- 
works (main figure), and the Lyapunov exponent (inset), as a function 
of the non-linearity parameter a. 



dom sequences. Furthermore, we have found that the higher 
is the non-linearity in a conservative deterministic dynamical 
system, the more disconnected is the corresponding fc-motif 
network. In Fig. [2] we show an example of this behaviour for 
a well-known two-dim ension al area-preserving deterministic 
map, the standard map MS22I1 . Each point in Fig.[2]represents 
the number of components in the 3-motif network obtained 
from an ensemble of trajectories produced for a specific value 
of the non-linearity parameter a. We observe that the number 
of components increases with a, and this behavior is similar 
to that of the positive Lyapunov exponent of the map, shown 
in the inset (see also 12211 ). 



Summing up, in this Letter we have introduced a general 
method to construct networks out of any symbolic sequen- 
tial data. The method is based on two different steps: first 
it extracts in a "natural" way motifs, i.e. those recurrent 
short strings which play the same role words do in language; 
then it represents correlations of motifs within sequences as 
a network. Important information from the original data are 
embedded in such a network and can be easily retrieved as 
shown with different applications (a biological system, a so- 
cial dialog and a dynamical system). With respect to previous 
linguistic methods, our approach does not need the a priori 
knowledge of a given dictionary, and also allows to compare 
different ensembles, corresponding, for example, to different 
values of control parameters in dynamical systems. All this 
makes the method very general and opens up a wide range 
of applications from the study of written text, to the analysis 
of sheet music or sequences of dance movements. Moreover, 
the method does not use parameters on the position of motifs 
in order to correlate them, since co-occurrences are computed 
within sequences, which represent natural interruptions of a 
corpora of data (proteins in a proteome, posts in a blog, dif- 
ferent initial conditions in a symbolic dynamics, etc.). 
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Supplementary material to "Networks of 
motifs from sequences of symbols" 

The properties of fc-motif networks can reveal important 
characteristics of the message encrypted in the original data, 
as the analysis of topological quantities (clustering coefficient, 
average path length and degree distributions) has helped to 
understand various linguistic features in networks of words 
co-occuring in sentences IISlllS2ll . and also to model how lan- 
guage has evolved in networks of conceptually-related words 
1 S3I1 . We present here details and further results on the ap- 
plication of the method described in the main article to three 
different datasets: proteomic sequences, short text messages 
acquired from Twitter, the well-known social network and 
microblogging platform, and ensembles of sequences derived 
from dynamical trajectories of the standard map by means of 
a symbolic dynamics approach. 



BIOLOGICAL SEQUENCES 

Methods to study over- or unde r-rep resentation of partic- 
ular motifs in a complete genome I1S41 IS5f or in a proteome 
I S6I1 . have already been proposed, and the results have been 
used to make functional deductions. Although the informa- 
tion contained in strings deviating from expectancy is useful 
for the analysis of many biological mechanisms IS7I1 . it turns 
out to be not sufficient for a complete and exhaustive interpre- 
tation of the genomic and proteomic message. A fundamental 
key to its comprehension is in fact hidden in the correlations 
among recurrent patterns of strings. The spatial structure of 
proteins provides an example: when a protein folds, segments 
distant on the sequence come to be close to each others in the 
space. This can happen because two (or more) segments need 
to physically interact in order to perform the biological func- 
tion the protein is supposed to go through. Such a mechanism 
translates into a statistical correlation between short motifs of 
aminoacids, which is well captured by an analysis in terms of 
fc-motif networks. 



Human proteome 

In our application, we have considered the ensemble of se- 
quences relative to the human proteome IIS 811 . It consists of 
34180 aminoacidic sequences of variable size, with an aver- 
age length of 481 letters. For this dataset, we have computed 
the probabilities p obs and p ex P for each of the 20 3 = 8000 
possible strings of three aminoacids, and we have selected as 

3-motifs the strings satisfying > + 2er, hence 

creating the dictionary Z% jS9ll . The entries of the dictio- 
nary are the nodes of the 3 -motif network. The node X is 
then linked to Y with a directed arc if the number of times 
that motif Y follows motif X within the same protein is sta- 



p^(y\x] ) + 2(7 ■ The statistical significance p p ^ v { ^}^ is 
also the weight of the arc. In this way we obtain the 3 -motif 
graph of 199 nodes and 1302 directed links, shown in Fig. 
1 of the main article. The graph has 86 isolated nodes (not 
displayed in Figure), while the remaining 113 nodes are orga- 
nized into 10 weak components. The largest component of the 
graph contains 5 clusters, detected by means of the MCI algo- 
rithm lis 1 Oil . Therefore, 15 different communities are present 
in the graph. In Table HI1 we report, for each community, the 
number of nodes and its total internal weight, defined as the 
sum of the weights of links between nodes of the communi- 
ties normalized by the sum of the weights of links incident 
in nodes of the comm unity. By submitting a query to the 
Prosite database IIS 1 111 we have obtained, for each couple of 
connected motifs belonging to the same community, the list 
of all proteins, classified by domain, where the two motifs 
co-occur. The results show that linked couples of motifs be- 
longing to the same community, all co-occur in the same kind 
of domains. In addition to this, one can associate 9 of these 
15 communities just to one protein domain, since the majority 
of co-occurrences emerge in proteins matching a well-defined 
function. In Table HI1 we report, when possible, the association 
to a single protein domain, together with the ratio between the 
number of times the couple of motifs with the highest weight 
occurred in that specific domain, and the total number of co- 
occurrences in the database. 

An alogous results were also found for the 4-motif graph 
I S12I1 . while it is not possible to derive the same kind of infor- 



TABLE II. List of communities in the 3-motif network of the human 
proteome. Community labels as in Fig. 1 of the main text, number 
of nodes, total internal weight, associated domain, and the domain 
specificity are reported. 



tistically significant, according to the relation: 
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FIG. 3. Components of the 4-motifs network of the twitter dataset. Each component and its associated topic are described in table Hill 



mation by using lower order Markov models to construct dic- 
tionaries. For example, the 3-motif network constructed with 
a dictionary based on a lower order approximation rather than 
on a 2-bodies Markov chain, exhibits a community structure 
with just four communities, none of which could be identified 
with a functional protein domain. 



SOCIAL NETWORKS AND MICROBLOGGING 

By means of fc-motif networks, information can also be 
retrieved from datasets of social dialogs and microblogging 
websites. Although in these cases, in principle, a dictionary 
is a-priori known, not all terms used in the Internet language 
are always lis ted in a dictionary: abbreviations, puns, leet lan- 
guage words MS13I1 . names of websites or names of public fig- 
ures, are just some examples. Moreover, some expressions 
or combinations of terms appear more frequently in some pe- 
riods or contexts due to the interest to some hot topics. In 
addition to this, the method of fc-motif networks turns to be 
very useful in all those contexts where it is necessary to pro- 
cess and compact information from large amount of symbolic 
data. This is the case of Internet, where the amount of text 
data provided by blogs, dialogs in social networks, forums, 
etc. is growing and growing. 

In the following, we provide details on how network of mo- 
tifs ar e able to d educe information about hot topics and cas- 
cades | S15 , Slotl in a dataset extracted from Twitter, a well- 
know platform for social networking and microblogging. 



Twitter 

Twitter jS14ll is a social networking and microblogging ser- 
vice which allows users to send short messages known as 
tweets. Tweets are composed only of text, with a strict limit 



of 140 characters: they are displayed on the author's pro- 
file page and delivered to the authors subscribers, who are 
also known as "followers". The dataset we have analyzed 
is a collection of 28143 tweets, crawled on two days, from 
the 23rd to 24th A pril 2 010, and selected through the Twit- 
ter Streaming API IS17I1 if they contained the string #lead- 
ersdebate. The choice of such a keyword, called in Twitter 
also hashtag, was aimed to select all those tweets concerning 
electoral campaign in UK, where general election to elect the 
members of the House of Commons would have taken place 
two weeks later. We have analyzed the dataset removing all 
blank spaces between words and all symbols that where not 
numbers or letters (punctuation, symbols like $, @, *, etc.) 
and not distinguishing between lower- and upper-case letters. 
From these sequences, dictionaries of motifs Z% and Z± have 
been extracted, selecting respectively the 10% and 1% of most 
significant strings of 3 and 4 letters. As described in the 
main text, we have constructed networks whose nodes rep- 
resent the entries of a dictionary, and an arc is drawn from the 
node representing string X to the node standing for string Y, 
if p obs (Y\X)/p exp (Y\X) is greater than a certain threshold. 
In Fig. [3] we show the 4-motifs network when the threshold 
is set equal to 400 (isolated nodes not reported). Such a high 
threshold is chosen to have a small network that can be eas- 
ily visualized and studied. More information can be obtained 
by setting the threshold to lower values or analyzing networks 
made up of motifs of different length k. Searching in the orig- 
inal dataset the connected motifs, it is possible to associate 
each component to a particular tweet which generated a cas- 
cade or with a specific expression, related to a specific hot 
topic discussed by users of the microblogging platform. For 
all components of Fig. [3] we report in Table Hill the tweet or 
expression associated and its meaning. For example, compo- 
nent 1 and 4 can be associated to two exit polls disclosed on 
those days by two different journals, or component 6 to the 
name "Gillian Duffy", a 65-years old pensioner involved in a 
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political scandal with British PM Gordon Brown during the 
election tour (Brown's remarks of her as a "bigoted woman" 
were accidentally recorded and broadcast). 



TABLE III. In relation to Fig. [5] we report the number of nodes, 
links, the tweet or the expression containing the motifs and the topic 
associated to each of the 13 communities 



SYMBOLIC DYNAMICS 

Symbolic dynamics is a general method to transform trajec- 
tories of dynamical systems into sequences of symbols. The 
distinct feature in symbolic dynamics is that time is measured 
in discrete intervals. So at each time interval the system is in 
a particular state. Each state is associated with a symbol and 
the evolution of the system is then described by a sequence of 
symbols. The method turns to be very useful in all those cases 
where system states and time are inherently discrete. In case 
the time scale of the system or its states are not discrete, one 
has to set a coarse-grained description of the system. Different 
initial conditions usually generate different trajectories in the 
phase space, which map onto different sequences of symbols. 
A large number of initial conditions produces an ensemble of 
sequences whose analysis can be addressed with the method 
based on networks of motifs, as described in the main article. 

In the following, we will describe the application of the 
method to the standard map, and we will show how the related 
networks of motifs shape according to its chaotic behavior. 



Standard Map 

The standard map, also known as Chirikov map, is a bidi- 
mensional area-pre servi ng chaotic map. It maps a square with 
side 2ir onto itself IIS22I1 . It is described by the equations: 



Xt+i =Pt + asmxt 
Pt+i =Pt + x t+ i 



mod 2n 
mod 2tt 



(7) 



where t represents time iteration and a is a parameter as- 
suming real values. The map is increasingly chaotic as a in- 
creases (see inset of Fig. 2 in the main article to see a plot 
of the Lyapunov exponent as a function of the parameter a). 
For a = 0, the map is linear and only periodic and quasiperi- 
odic orbits are allowed. When evolution of trajectories are 
plotted in the phase space (the xp plane), periodic orbits ap- 
pear as closed curves, and quasiperiodic orbits as necklaces of 
closed curves whose centers lie in another larger closed curve. 
Which type of orbit is observed depends on the map's initial 
conditions. When the nonlinearity of the map increases, for 
appropriate initial conditions it is possible to observe chaotic 
dynamics. 

In order to obtain sequences from the s tanda rd map (0 by 
means of the symbolic dynamic approach IIS23I1 . one needs to 
make a coarse graining of the phase space, defining a discrete 
and finite number of possible states the trajectory can occupy. 
This way it is possible to associate a symbol to each of the 
possible states and derive a sequence according to the trajec- 
tory originating from an initial condition. We have coarse- 
grained the phase space into 25 (5 x 5) squares of equal size 
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and we have derived for different values of the parameter a, 
10 4 sequences of 10 3 symbols. In other words, this means to 
follow for 10 3 time steps the trajectories originating from 10 4 
different initial conditions. 

The idea is that closed orbits or quasi periodic-ones corre- 
spond to correlations between motifs and therefore in links of 
the graph of motifs. When the map becomes more and more 
chaotic, closed orbits disappear and, correspondingly, the net- 
works break in many components. In the extreme limit of 
map highly chaotic (a > 3), the network of motifs are com- 
pletly disconnected, with all nodes isolated. Nevertheless, this 
scenario is different from the one generated by stochastic se- 
quences, since in this case motifs would not be detected, while 



8 



this still happens in the chaotic map, although only for small 
values of k. This result is well depicted in Fig. 3 of the main 
article, where the number of components of the 3-motif graphs 
is plotted as a function of the value a of the map generating 
the ensemble. This curve is shown to have the same behavior 
of the Lyapunov exponent, as reported in the inset of the same 
figure. 
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